Recap of Data Wrangling with dplyr

STAT 331

Ugliest Plot

https://docs.google.com/presentation/d/19u5djgMsPLtxoM-rfAQuLyP89B4nYdW8uBhNAEJQoS8/edit?usp=sharing

The tidyverse Philosophy

A Vignette

subset()

Return subsets of vectors, matrices or data frames which meet conditions.

subset argument states how the rows of the dataframe should be filtered

subset(surveys, 
       subset = 
         species_id == "DS")

select argument states what columns should be selected from the dataframe

subset(surveys, 
       subset = species_id == "DS", 
       select = c(weight, 
                  hindfoot_length)
       )

We want functions that accomplish one task!

We want functions with intuitive names!

Data Wrangling Verbs

filter()

select()

mutate()

summarize()

arrange()

group_by()

Brainstorm definitions for each verb

filter()

select()

mutate()

group_by()

summarize()

arrange()

The Pipe |>

Preview Activity Review

Suppose we would like to study how the ratio of penguin body mass to flipper size differs across the species. Arrange the following steps into an order that accomplishes this goal (assuming the steps are connected with a |>).

arrange(med_mass_flipper_ratio)

group_by(species)

penguins

summarize(med_mass_flipper_ratio = median(mass_flipper_ratio))

mutate(mass_flipper_ratio = body_mass_g / flipper_length_mm))

A Different Context

You have data on each Cal Poly student for the 2020-2021 academic year. You are tasked with reporting how the number of CR/NC courses students take differs based on department.

name department CRNC_f20 CRNC_w21 CRNC_s21
Gonzales, Yasmin Business 1 3 0
al-Hossain, Misbaah Biology 2 2 1
Hyland, Cassidy Liberal Studies 0 0 1
Landry, Conner Political Science 2 0 0
Lai Zhou, Meghan Business 0 0 2
Navarrete, Guadalupe Business 0 1 2
Yahashi, Hannah Liberal Studies 0 1 0
Mcbroom, Gabrielle Biology 1 1 1
Hepp, Kayla Political Science 0 2 4
Yost, Aubrey Chemistry 1 0 2

What data wrangling operations would you use?

What order would you use to accomplish this goal?

Problem Statement:

Department totals for number of CR / NC courses]

Step 1: Get totals for each student

Step 2: Get department totals

Step 3: Arrange the totals

Getting Specific

Often you are interested in one specific summary statistic!

# A tibble: 5 × 2
# Groups:   department [5]
  department            n
  <chr>             <int>
1 Business              3
2 Biology               2
3 Liberal Studies       2
4 Political Science     2
5 Chemistry             1
# A tibble: 1 × 2
# Groups:   department [1]
  department            n
  <chr>             <int>
1 Political Science     2

A Handy Tool

pull()

  • Extracts entries from dataframes
[1] 2

Your Turn: PA Exploration

  • Find your group
  • Introduce yourself
  • Decide on team roles (will change each week)
    • Reporter – Types the solutions
    • Editor – Asks professor team questions, double checks what is typed
    • Facilitator – Leads discussion, makes sure everyone understands the task
    • Captain – Encourages participation, enforces norms, brings conversation back if it deviates